Search CORE

106 research outputs found

A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Author: Joki? Danka
Krstev Cvetana
Stankovi? Ranka
Publication venue: OASIcs - OpenAccess Series in Informatics. 3rd Conference on Language, Data and Knowledge (LDK 2021)
Publication date: 01/01/2021
Field of study

Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset

Dagstuhl Research Online Publication Server

The Dictionary of the Serbian Academy: from the Text to the Lexical Database

Author: Krstev Cvetana
Sabo Olga
Stanković Ranka
Stijović Rada
Vitas Duško
Publication venue: Ljubljana : Ljubljana University Press, Faculty of Arts
Publication date: 01/01/2018
Field of study

In this paper we discuss the project of digitization of the Dictionary of the Serbo-Croatian Standard and Vernacular Language. Scanning and character recognition were a particular challenge, since various non-standard character set encoding was used in the course of the almost 60-year long production of the dictionary. The first aim of the project was to formalize the micro-structure of the dictionary articles in order to parse the digitized text of and transform it into structured data stored in relational lexical database. This approach is compatible with several standard structured forms and ontologies (TEI, LMF, Ontolex, LexInfo). A lexical database model was designed in compliance with these structured forms, following mostly the lemon model. Mapping of the lexical entry markers to LexInfo and TEI enabled export of the lexical data to the mentioned formats. A software solution for the dictionary text analysis, parsing and lexical database population was developed and tested on the first and the last published volumes of the dictionary (which contain 27,141 articles in total). An evaluation of the results shows that the developed model and software solution can be successfully used for the other volumes as well

Serbian Academy of Science and Arts Digital Archive (DAIS)

SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian

Author: Krstev Cvetana
Marković Aleksandra
Stanković Ranka
Stijović Rada
Vitas Duško
Šandrih Branislava
Publication venue: Brno : Lexical Computing CZ s.r.o.
Publication date: 01/01/2019
Field of study

In this paper we present a model for selection of good dictionary examples for Serbian and the development of initial model components. The method used is based on a thorough analysis of various lexical and syntactic features in a corpus compiled of examples from the five digitized volumes of the Serbian Academy of Sciences and Arts (SASA) dictionary. The initial set of features was inspired by a similar approach for other languages. The feature distribution of examples from this corpus is compared with the feature distribution of sentence samples extracted from corpora comprising various texts. The analysis showed that there is a group of features which are strong indicators that a sentence should not be used as an example. The remaining features, including detection of non-standard and other marked lexis from the SASA dictionary, are used for ranking. The selected candidate examples, represented as featurevectors, are used with the GDEX ranking tool for Serbian candidate examples and a supervised machine learning model for classification on standard and non-standard Serbian sentences, for further integration into a solution for present and future dictionary production projects

Serbian Academy of Science and Arts Digital Archive (DAIS)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Thresholds to the "Great Unread"

Author: Arias Rosario
Berenike Herrmann J.
Galleron Ioana
Krstev Cvetana
Mihurko Poniž Katja
Odebrecht Carolin
Patraş Roxana
Yesypenko Dmytro
Publication venue
Publication date: 29/10/2021
Field of study

Repository of University of Nova Gorica

Multiword expressions: Insights from a multi-lingual perspective

Author: Barbu Mititelu Verginica
Bargmann Sascha
Dimitrova Tsvetana
El Marouf Ismail
Fotopoulou Aggeliki
Giouli Voula
Hanks Patrick
Koeva Svetla
Krstev Cvetana
Kuiper Koenraad
Kyriacopoulou Tita
Laporte Éric
Leseva Svetlozara
Markantonatou Stella
Martineau Claude
Nevado Llopis Almudena
Oakes Michael
Osenova Petya
Parra Escartín Carla
Sailer Manfred
Samaridi Niki
Simov Kiril
Sánchez Martínez Eoghan
Vitas Duško
Publication venue: Language Science Press
Publication date: 16/10/2017
Field of study

Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar

Language Science Press